Machine learning techniques for forecasting agricultural prices: A case of brinjal in Odisha, India

Background Price forecasting of perishable crop like vegetables has importance implications to the farmers, traders as well as consumers. Timely and accurate forecast of the price helps the farmers switch between the alternative nearby markets to sale their produce and getting good prices. The farmers can use the information to make choices around the timing of marketing. For forecasting price of agricultural commodities, several statistical models have been applied in past but those models have their own limitations in terms of assumptions. Methods In recent times, Machine Learning (ML) techniques have been much successful in modeling time series data. Though, numerous empirical studies have shown that ML approaches outperform time series models in forecasting time series, but their application in forecasting vegetables prices in India is scared. In the present investigation, an attempt has been made to explore efficient ML algorithms e.g. Generalized Neural Network (GRNN), Support Vector Regression (SVR), Random Forest (RF) and Gradient Boosting Machine (GBM) for forecasting wholesale price of Brinjal in seventeen major markets of Odisha, India. Results An empirical comparison of the predictive accuracies of different models with that of the usual stochastic model i.e. Autoregressive integrated moving average (ARIMA) model is carried out and it is observed that ML techniques particularly GRNN performs better in most of the cases. The superiority of the models is established by means of Model Confidence Set (MCS), and other accuracy measures such as Mean Error (ME), Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Prediction Error (MAPE). To this end, Diebold-Mariano test is performed to test for the significant differences in predictive accuracy of different models. Conclusions Among the machine learning techniques, GRNN performs better in all the seventeen markets as compared to other techniques. RF performs at par with GRNN in four markets. The accuracies of other techniques such as SVR, GBM and ARIMA are not up to the mark.


Introduction
Agriculture plays a vital role in the Indian economy. Over 70 per cent of the rural households depend on agriculture. Agriculture is an important sector of Indian economy as it contributes about 20% to the total GDP and provides employment to over 60% of the population. Indian agriculture has registered impressive growth over last few years [1]. The continuous supply of agricultural commodities. Horticulture sector encompasses a wide range of crops like fruits, vegetables, flowers, spices, plantation crops like coconut, beverages like tea and coffee and some medicinal and aromatic plants. Statistics provided by National Horticulture Board shows that India accounting for 57.31% of the total production of vegetables and 6.92% brinjal (Horticultural Statistics at a Glance 2018). Brinjal is one of the most common tropical vegetables grown in India. Brinjal is a very nutritive vegetable that provides 52.0 mg of chlorine, 47.0 mg of phosphorus, 44.0 mg of Sulphur, 6.4 mg of vitamin A, 18.0 mg of Calcium, 24 k cal of energy, 1.3 g of fiber, 0.9 mg of iron, 1.4 g of protein, 12.0 mg of vitamin C, and 18.0 mg of oxalic acid, nutrients also available from a 100g of brinjal [2]. Odisha ranks 4 th position as far as production of vegetable is concerned in National level. Brinjal is a native of India, and is cultivated across many states in large scale and consumed by almost all household. In 2017-18, as per the records of National Horticulture Database, the area under brinjal production was 1.17ha, with production of 20.13 lakh tonnes and productivity of 17.07 mt/ha. As far as brinjal production is concerned, with 15.75% share of production in the national level, Odisha ranks 2 nd after West Bengal (Fig 1). The marketing decisions could be enriched with correct price forecasts for maximizing the returns and reducing the risk.
The price and arrival data information improve the bargaining position of the farmers and improves competition between traders. When the information on price is available, the farmer remains in a better position to switch between the alternative nearby markets to dispose the produce and getting good prices for their products. The farmers can use the information to make choices around the timing of marketing. Consequently, erratic price variations should be reduced as arbitrage over time and space becomes easier and more widespread [3]. A significant characteristic of vegetable price series is the seasonality, which is the biggest obstacle for obtaining accurate forecasts of vegetable prices. Given the complexity of the price series, many models have been specified for capturing the behavior of vegetable prices, but researchers have not reached a consensus on the best model for vegetable prices [4].
Within time series framework, many linear and nonlinear approaches, such as Autoregressive integrated moving average (ARIMA) model, Seasonal ARIMA (SARIMA), Generalized autoregressive conditional heteroscedastic (GARCH) model have been developed. In past, many studies have been conducted with the objective of predicting agricultural commodity prices [5][6][7][8][9]. It is reported that SARIMA model outperforms other price forecasting models for forecasting onion prices in Mumbai markets [5]. Application of SARIMA model for forecasting meat exports from India can be found in [6]. Price volatility in agricultural commodity in Inida has been extensively studied in [7]. Forecasting of Retail Price of Arhar Dal in Karnal market of Haryana has been carried out using stochastic models [8]. Different statistical models for forecasting volatility in onion price in selected markets of Delhi have been studied in [9].
In recent times, algorithms of Machine Learning (ML) which have developed within data science paradigm [10] has been dominated. It has been applied to forecasting financial and economic time series [11,12]. Results of numerous empirical studies have shown that ML approaches outperform time series models in forecasting different financial assets [13]. A comparative analysis of statistical models and machine learning techniques can be found in [14]. Among the ML techniques, Artificial Neural Network (ANN), Generalized Neural Network (GRNN), Support Vector Regression (SVR), Random Forest (RF) and Gradient Boosting Machine (GBM) etc. are widely used. All these techniques are data-driven nonparametric techniques which learn the stochastic dependency in the data. It is reported that ANNs outperform the classical statistical methods such as linear regression and Box-Jenkins approaches [15]. GRNN is considered to be a promising alternative to the linear and nonlinear time series models [16]. Different intelligent models; namely, ANN, SVR, and extreme learning machine (ELM) have been applied for forecasting of bean and pig grain products [17]. With high seasonality [18,19], reported that, machine learning and deep learning-based algorithms are the efficient approaches for solving time series prediction problems. The superiority of neural network over statistical methods is established in predicting agricultural prices [20]. Strategies in forecasting time series using GRNN by taking the advantage of their inherent properties to generate fast, highly accurate forecasts are nicely described in [21]. SVR was applied in predicting hog prices [22]. SVR has been used in financial time series forecast and exchange rate forecasting [23,24]. GBM algorithm was applied to deal with the time series prediction tasks for coastal bridge engineering [25]. A random forest based regression model was developed in [26] to predict daily evapotranspiration from in-situ meteorological data and fluxes, satellite leaf area index (LAI), and land surface temperature data and found that the LAI is the most important feature. An application of Random Forests regression for Crop Yield Predictions may be found in [27]. For some theoretical developments of modeling time series data using random forest, one may refer to [28].
But in most of the studies either weekly or monthly price data has been considered. In averaging the data to compute weekly or monthly series, the actual variability present in the data is not truly represented. In ML algorithms, depending on the purpose of the analysis, logic of modeling on the basis of available data is build up. This avoids the complex and lengthy premodel stage of statistical testing of various hypotheses about studied process. The main objective of present paper is to compare the predictive accuracies of the efficient ML algorithms: GRNN, SVR, RF and GBM for forecasting wholesale price of Brinjal in major markets of Odisha, India. Unlike other previous studies, here daily data has been considered and variation in prices of brinjal in almost all the major vegetables markets of Odisha, India have been taken into consideration. The hypothesis addressed in the present investigation is to determine best forecast model in terms of prediction accuracy. The performance comparison of different models has been carried out form many angles including, Circular plot, Radar plot, the model comparison set, Diebold Mariano test and other statistical measures e.g. RMSE, MAPE etc. ;ðBÞ ¼ 1 À ; 1 B À ; 2 B 2 À � � � ; P B P and yðBÞ ¼ 1 À y 1 B À y 2 B 2 À � � � À y P B P

Autoregressive integrated moving average (ARMA) model
In the above, B is the backshift operator, i.e., By t = y t−1 , p and q are the order of autoregressive and moving average respectively. ε t is a white noise process.

Support vector regression (SVR)
For a given data set , where x i 2R n input vector is, y i 2R is scalar output and N corresponds to size of data set, general form of Nonlinear SVR estimating function is: Where φð:Þ : R n ! R n h is a nonlinear mapping function from original input space into a higher dimensional feature space, which can be infinitely dimensional, w 2 R n h is weight vector, b is bias term and superscript T indicates transpose. The coefficients w and b are estimated from data by minimizing the following regularized risk function: In above equation, first term 1 2 jjwjj 2 is called 'regularised term', which measures flatness of the function. Second term 1 is estimated by Vapnik εinsensitive loss function. Both C and ε are user-determined hyper-parameter. Here, the Vapnik Loss function is given by: Where y i denotes actual value and if(x i ) is the estimated value at ith period. The algorithm of SVR is depicted in Fig 2. Few applications of nonlinear SVR in forecasting time series may be found in [29,30].

Random forest (RF)
Random forest is based on bagging technique (bootstrap aggregation) over decision trees [31]. Bagging reduces the variance of the base algorithms when they are weakly correlated. It is a flexible, easy to use supervised machine learning algorithm. It is also one of the most used algorithms, because of its simplicity and diversity. The benefits of bagging in forecasting time seirs have been listed in [32]. Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. In RF the correlation between trees is reduced by randomization in two directions. Firstly, each tree is trained on a bootstrapped subset. Secondly, the feature by which splitting is performed in each node is not selected from all possible features, but only from their random subset of size m. The RF algorithm generates each of the N trees independently, which makes it very easy to parallelize. For each tree, it constructs a full binary tree of maximum depth. Thereby efficiency of RF performance is achieved. The schematic representation of RF is given in Fig 3. Here, OOB stands for Out-of-Bag sample.

Generalized regression neural network (GRNN)
Generalized regression neural network is related to the radial basis neural networks, which are found on kernel regression. It can be treated as a normalized radial basis neural networks in which there is a hidden neuron centered at every training case. These radial basis function units are generally probability density function such as the Gaussian [33]. GRNN approximates any arbitrary function between input and target vectors; fast training and convergence to the optimal regression surface as the training data becomes very large [34]. This makes GRNN a very advantageous tool to perform predictions. The GRNN architecture as depicted in Fig 4 has four layers: an  input layer, a hidden layer, a summation layer, and an output layer. The hidden layer has radial basis neurons with training examples as centers. The output of hidden layer neuron is linked to the nearness of the input vector to the center, scaled by the smoothing parameter [21].

Gradient Boosting Machine (GBM)
Gradient Boosting Machine method (GBM) proposed by [35,36] is a boosting algorithm used when dealing with plenty of data to make a prediction with high prediction power. Boosting is actually an ensemble of learning algorithms which combines the prediction of several base estimators in order to improve robustness over a single estimator. It follows the procedure of sequentially building a composition of machine learning algorithms, when each of them seeks to compensate for the shortcomings of the composition of all previous algorithms. Compared to bagging, boosting does not use simple voting but a weighted one. It combines multiple weak or average predictors to a build strong predictor. In GBM, the nodes in every decision tree take a different subset of features for selecting the best split. This means that the individual trees aren't all the same and hence they are able to capture different signals from the data. Additionally, each new tree takes into account the errors or mistakes made by the previous trees. So, every successive decision tree is built on the errors of the previous trees. This is how the trees in a gradient boosting machine algorithm are built sequentially. The procedure is depicted in Fig 5.

Validation of forecasts
The dataset for each market was divided in two parts before analysis with 90% of the observations for estimation (model development) and remaining 10% for validation purpose. Comparative assessment of prediction performance of different models namely ARIMA, RF, GRNN, GBM and SVR models was carried out in terms of mean error (ME), Mean absolute error (MAE), root mean square error (RMSE) and Mean absolute percentage error (MAPE) based on the following formulae: jy tþi Àŷ tþi j

RMSE ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where h denotes the number of observations for validation, y i is the observed value andŷ i is the predicted one. Diebold Mariano test [37] was also conducted for different pairs of models to test for differences in predictive accuracy between any two competing models. Beside these, MCS [38,39] has been used to find the superior set of models for prediction. To strengthen our claim on the superiority of the model, circular plot and radar plots [40] have also been utilized.

Data description
The overall summary statistics of the price data are reported in Table 1. A perusal of Table 1 indicates that average price remains high in Dhenkanal market whereas lower average price is observed in Bargarh. Angul experienced maximum price (Rs 7500/quintal) as well as minimum price (Rs 250/quintal during the study period. The kurtosis is higher in almost all the markets leading to platykurtic nature of distribution. The variability in price series as observed by coefficient of variation (CV) ranges between minimum 25.25% in Hinjilicut to maximum of 95.41% in Athagarhmarket. The Jarque-Bera of normality indicated that all the market prices follow non-normal distribution. The price and arrival of Brinjal in various markets in the state of Odisha (Figs 6 and 7; Tables 2 and 3) gives an indication of demand and supply. There is potentiality for growing all types of tropical, sub-tropical and temperate vegetable in the region. The Individual market commodity arrival also depends on the infrastructure and handling capacity of any commodity in the specific market. It is also an indication of the production of commodity in that particular region. These indications need to be taken into consideration by the farmers in production capacity useful resources in the market. It is observed that the Bahadajholla (220.76 ton) receives largest volume of brinjal among the seventeen markets and is followed by Angul (211.24 ton), Hinjilicut (188.95 ton), Jaleswar 117.34 ton) and Sarankul (103.17 ton). Apart from these markets we are having some other market recorded very less quantity of brinjal arrival data i.e. Khunthabandha and Boudh market arrival data (3.70 and 3.92 ton) respectively. Thus, it can be concluded that the farmers of Odisha could take the commodity in regional markets like Bahadajholla, Angul,Hinjilicut and Jaleswar. The season wise analysis of arrival of brinjal in major markets of north eastern India defines that in the Bahadajholla market the arrivals are low during the months of October to December and during September to November. In Angul market the supply of brinjal during the April is lowest.

Fitting of models
The stochastic model i.e. ARIMA and machine learning techniques e.g. RF, SVR, GRNN and GBM as described in methodology sections have been fitted for the data under consideration.  Dependency of current price is taken upto 5 lags for all the markets. The lagged prices were considered as the exogenous variables in machine learning techniques. The preliminary order of ARIMA model was selected based on pattern of Autocorrelation function (ACF) and Partial autocorrelation function (PACF). The best fitted ARIMA model was chosen based on information criterion like Akaike Information Criterion (AIC), Schwartz Bayesian Criterion (SBC) and Hannan-Quinn information criterion (HQIC). The training of machine learning techniques has been carried out by optimizing the parameters and hyper parameters.

Discussions
The result of prediction performance measured by four statistics namely ME, MAE, RMSE and MAPE computed by the formulae described in the above section are reported in Table 4. A perusal of Table 4 indicates that, for all the markets, machine learning techniques perform better than that of usual ARIMA model. Among the four machine learning techniques used, in almost all the markets, GRNN performs better based on the above mentioned statistical measures except for Khunthabandha market where GBM performs better though the difference in gain in accuracy over GRNN and RF is not significant. The superiority set of models as found out by MCS is reported in Table 5. Among the 17 markets, in 14 markets it is observed that GRNN model is the superior model than that of other models; where as in remaining 4 markets namely Athagarh, Betnoti, Boudh and Khunthabandha, it is found that GRNN and RF performs at par and are superior models. To this end Diebold Mariano test was also conducted for different pairs of models to test for differences in predictive accuracy between any two competing models and it revealed that the predictive accuracy of GRNN is better than that of other technique in all the markets under consideration (Table 6). In four markets namely Athagarh, Betnoti, Boudh and Khunthabandha, the predictive accuracy of GRNN and RF don't differ significantly. To visualize the prediction performance of different models, the Circular plots and Radar plots have been obtained and depicted in Figs 8 and 9 respectively. For circular plot, 30 steps ahead forecast (one month) has been depicted and it is observed that GRNN prediction is closest to the test value in majority of the markets while in few markets

Conclusions
The market arrival and price data of seventeen major markets in Odisha, India was analysed.
The study of price of brinjal prevailing in major eastern Indian markets describe that the highest price prevails in Dhenkanal market (3063.10 Rs/Quintal) ( Table 3). The average annual price of brinjal prevailing in Nimapara and Banki is 7.4% and 4.4%, respectively higher than that in Hinjilicut. Thus, the Hinjilicut farmers may approach to Banki, Nimapara and Dhenkanal for better price realization. The price behaviour based on seasonal index revealed that the highest price of brinjal prevails in the month of July and followed by October and November in Dhenkanal market. The lowest price is observed in September and December in the Athagarh market. Thus, it is revealed that the Athagarh market despite receiving low volume of brinjal compared to Nimapara and Banki markets provides opportunity for exploiting better price prevailing there. Therefore, forecast of market price of the agricultural produce will help farmers to gain more profit by taking the products to the nearby market where better price realization prevails. Based on the historical price pattern in different markets, the forecasting models were developed and it was found that ARIMA model cannot capture the variation in prices over time and the accuracy of this model is also not up to the mark. The machine learning techniques namely RF, GRNN, GBM and SVR performed better than that of ARIMA model in all the markets. Among the machine learning techniques used in the present study, GRNN has performed better than that of others in majority of the markets. The study revealed that if the model is trained properly with sufficient observations, we can achieve the desired accuracy in prediction using GRNN and other ML techniques. However, price of a commodity may depend on several exogenous factors including weather variables which have not been considered in the present study. Moreover, carrying the agriculture produce in distant markets offers various difficulties in terms of low quantities and related marketing risks and uncertainty. The farmers need to be empowered to be able to aggregate the product so as to exploit the economies of scale and take benefit of current institutional changes in agricultural marketing. In future study, some nearby markets may be selected for investigating spatial dependency and incorporating that dependency in developing model to check for any significant gain in accuracy.